A birds-eye view of Recurrent Neural Networks

Jeremy Watt (dgsix, @neonwatty)

Agenda

1. Naturally ordered data and good storytelling

2. Dynamic system models

3. Putting Humpty-Dumpty back together again

0. Prologue: supervised learning

  • supervised learning: a jargon phrase, means the pursuit of relationship between input and output data
  • "relationship" means a math / computational function (a "natural law") describing input/output correspondence
  • determining appropriate relationship between input and output involves:
    • finding an appropriate model
    • appropriately tuning parameters of that model
  • simplest case: we know what model to use, and only need tune its params
  • e.g., a toy regression dataset, here a line is clearly the correct model

  • so you choose a linear model
$$ \text{model}\left(x,\Theta\right) = w_0 + xw_1 $$
  • you then press it onto the data correctly by tuning the params $\Theta = \left\{w_0,w_1\right\}$



1. Naturally ordered data and good storytelling

Naturally ordered data

  • the "standard" stuff is designed to deal with data that arrives in no particular order (data that is "i.i.d."),
  • data is assumed to just sort of pops into existance at random - like this



Naturally ordered data

  • thats fine for a lot of big problems --> object detection, genetics, many metadata problems, etc., - where the "arrival order" of data isn't important
  • but many kinds of data are naturally ordered
  • that is, the data "arrives" like this



Common examples of naturally ordered data

  • time naturally orders the arrival of tons of data
    • time-stamped metadata (e.g., customer X made purchase Y on date Z)
    • financial / economic data
    • natural signals - e.g., electric (e.g., motor output), speech

Common examples of naturally ordered data

  • time naturally orders the arrival of tons of data
    • time-stamped metadata (e.g., customer X made purchase Y on date Z)
    • financial / economic data
    • natural signals - e.g., electric (e.g., motor output), speech
  • text data e.g., English ordered from left to right

$\,$ $\,$ $\,$

Naturally ordered data and supervised learning

  • indeed in many occupations most data is naturally ordered
  • you still want to perform supervised learning tasks with such data
  • finance: predict future (price) value of a commodity
    • input: historical series data
    • output: future series values

Time series in black, training fit in blue, predictions in green.
  • speech recognition: translate raw audio to text
    • input: raw audio
    • output: text translation

(they're definitely better than cats)
  • machine translation: translate sentence or phrase in one language to another
    • input: sentence in language X
    • output: sentence in language Y

Choose your weapon

  • how do you include natural ordering in your model?
  • autoregressive modeling: process data so its no longer naturally ordered, and treat as "standard"
    • use a "sliding window" --> e.g., moving average with time series, convolution networks for speech
  • recurrent neural networks: (RNNs) go back to the drawing board and come up with new models that incorporate natural order
    • these models are collectively referred to as "dynamic systems with hidden states"
  • we're here to talk about the latter!

What is an RNN? The typical origin story

  • a typical origin story for RNNs echoed throughout the internet goes like this

    • "RNNs have infinite memory"
    • "an RNN is just a feedforward network with self-loops / multiple copies of the same network with each passing to the next in succession"
  • true statements, but not helpful if you don't already know what you're talking about

  • never pick up your weapons blade-first

Think by analogy

  • did you learn what a line / hyperplane was before using it as a regressor / classifier?
$$ \text{model}\left(x,\Theta\right) = w_0 + xw_1 $$
$$ \text{model}\left(x,\Theta\right) = w_0 + xw_1 + x^2w_2 + \cdots + w_Bx^B $$



$$ \text{model}\left(x,\Theta\right) = w_0 + \text{tanh}\left(w_{1,0} + xw_{1,1}\right)w_1 + \cdots + \text{tanh}\left(w_{1,B} + xw_{1,B}\right)w_B $$



  • lets do that here with RNNs - lets study the models first

Dynamic systems with hidden states

Examples

  • lets bash our brains in with examples
  • for simplicity lets deal with ordered input data of the form
$$ x_1,\,x_2,\,x_3,\,...,x_{t},\,... $$
  • in this sequence $x_1$ comes first, and before $x_2$, $x_2$ comes before $x_3$, etc.,



Jargon tear-down time

  • we want to understand what a "dynamic system with hidden states" is, lets take the jargon-phrase apart
  • first: what is a "dynamic system"?
  • its just a function that takes in elements of our input sequence and produces an output

Just the "dynamic systems" part

  • so a dynamic system taking in a single element generally looks like
$$ h_t = f\left(x_t\right) $$
  • to state the obvious: $h_t$ a function of $x_t$, and $x_t$ alone
  • a simple example: $h_t = -2x_t + 1$

Now the "hidden" part

  • a dynamic system with a "hidden state" is still a function that takes in elements of our input and produces an output
  • however the output doesn't just say something about the specific inputs we plug in
  • it also says something about the entire input sequence up to and including the specific element(s) we plug in
  • that means if we plug in $x_t$ the output provides some summary (a count of some aspect, or a statistic) about the entire input sequence up to that point $x_1,\,x_2,\,...\,x_t$

Example 1: Running sums

  • task: compute a running sum of our input numbers on the fly - i.e., as they arrive
  • so when the $t^{th}$ number arrives, compute a sum of the first $t$ numbers $x_1,\,x_2,\,...,x_t$
  • lazy solution: when $x_t$ arrives sum up everything
\begin{array} \ \text{sum of the first $1$ elements:} \,\,\,\,\,\,\,\,\,\,\,\,\, h_1 = x_1 \\ \text{sum of the first $2$ elements:} \,\,\,\,\,\,\,\,\,\,\,\,\, h_2 = x_1 + x_2 \\ \text{sum of the first $3$ elements:} \,\,\,\,\,\,\,\,\,\,\,\,\, h_3 = x_1 + x_2 + x_3 \\ \text{sum of the first $4$ elements:} \,\,\,\,\,\,\,\,\,\,\,\,\, h_4 = x_1 + x_2 + x_3 + x_4 \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \vdots \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \vdots \\ \text{sum of the first $t$ elements:} \,\,\,\,\,\,\,\,\,\,\,\,\, h_{t} = x_1 + x_2 + x_3 + x_4 + \cdots + x_t \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \vdots \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \vdots \\ \end{array}
  • obviously super wasteful

  • the right way: a simple recursion

Example 1: Running sums

  • general recursion $h_t = h_{t-1} + x_t$ very effecient way to compute running sums, only $h_{t-1}$ and $x_t$ need be stored
  • $h_t$ is called a hidden state of the system, it "summarizes" $x_1,\,x_2,\,...,x_t$ by literally summing them up
$$ h_t = h_{t-1} + x_t = x_1 + x_2 + \cdots + x_t $$
  • general formula of dynamic system with hidden state
$$ h_t = f\left(h_{t-1},x_t \right) $$
  • compare to general dynamic system without hidden state
$$ h_t = f\left(x_t\right) $$
  • general formula of dynamic system with hidden state
$$ h_t = f\left(h_{t-1},x_t \right) $$
  • $h_t$ recurses on itself, allows it to act as a "summary" or "accumulator" variable, capturing something about the entire input sequence $x_1,\,x_2,\,...,x_t$
  • exactly what that "something" is depends on how you define the function $f$

Example 2: Running mean

  • the same idea falls out when computing a running mean
  • when $x_t$ arrives we want to compute $h_t = \frac{x_1 + x_2 + \cdots + x_t}{t}$ effeciently
  • a little tinkering gets you the recursion: $h_t = \frac{t-1}{t}h_{t-1} + \frac{1}{t}x_t$
  • dynamic system with hidden state $h_t$ which summarizes $x_1,\,x_2,\,...,x_t$ as its mean

Example 3: Exponential average

  • a basic generalization of the running mean $h_t = \frac{t-1}{t}h_{t-1} + \frac{1}{t}x_t$ is called the exponential average
$$ h_t = \alpha h_{t-1} + (1 - \alpha) x_t $$

$\,\,\,$ where $0 \leq \alpha \leq 1$

  • if you roll back the recursion you see that $h_t$ summarizes the input $x_1,\,x_2,\,...,x_t$ as an exponential average
  • a very common smoother built into most electronics for signal processing purposes (see more here)
  • for one particular value of $\alpha$ it smooths like this



  • here's what happens when you mess around with the value of $\alpha$



Example 4: Running maximum

  • compute the maximum on the run, effectively, naturally leads to familiar looking dynamic system with hidden state
$$ h_t = \text{maximum}\left(h_{t-1},x_t\right) $$
  • the hidden state $h_t$ summarizes the input $x_1,\,x_2,\,...,x_t$ by its maximum value



Example 5. Count zero-crossings

  • count the number of times a signal crosses the zero axis on the run
  • can be written effeciently as a dynamic system with hidden state: $h_t = f\left(h_{t-1},x_t\right)$



Example 6. Distribution

  • heck, if you're willing to discretize you can design $h$ to accumulate an approximate distribution of past values



You get the point, enough examples

  • a simple dynamic system with hidden states, defined on an input sequence $x_1,\,x_2,...,x_t$, looks like
$$ h_t = f\left(h_{t-1},x_t\right) $$
  • $h_t$ summarizes the entire input $x_1,\,x_2,\,...,x_t$ depending on how $f$ is defined
    • e.g., as a raw sum, mean, exponential average, maximum, count of zero-crossings,...
  • this is what people mean when they say an RNN / dynamic system has "infinite memory"

3. Putting Humpty-Dumpty back together again

  • remember, with supervised learning...

Its all about models and param tuning

  • to grasp standard linear supervised learning you need to first understand what a line / hyperplane since it is the model
- once you know what a line / hyperplane is, you can press it onto data 
  • to grasp standard nonlinear supervised learning you first need to understand nonlinear shapes)
- once you know about trees, FNNs, etc., you can press them onto data (by tuning params)
  • by analogy, to grasp RNNs you need to first understand "dynamic systems with hidden states"
- once you know about them, you can press them onto data (by tuning params)
  • well, you now know what a "dynamic system with hidden states" is - so you're ready to roll
  • by using such a model, that has accumulator / summarizing states, you hope you can learn "long term dependencies" if they exist

Using dynamic systems for supervised learning

  • a general "vanilla" parameterized dynamic system with hidden states used for RNN-based supervised learning:
$$ h_t = f\left(h_{t-1},x_t\right) = f\left(w_0 + h_{t-1}w_h + x_tw_x\right) $$

where $f$ is a simple function (e.g., $\text{tanh}$, $\text{relu}$, etc., and $w_0,\,w_h,\,w_x$ are tunable parameters

  • that is, to simplify things a linear combination of inputs is passed through a nonlinearity $f\left(\cdot\right)$
  • for vector-valued input the analogous formula is used
$$ \mathbf{h}_t = f\left(\mathbf{w}_0 + \mathbf{W}_h\mathbf{h}_{t-1} + \mathbf{W}_x\mathbf{x}_t\right) $$

Example 6. Time series analysis

  • a vanilla application with a vanilla model...

Time series in black, training fit in blue, predictions in green.

Example 7. Generating text

  • you've probably seen (twitter) bots that can spit out text that sounds like certain famous people (e.g., Shakespear, politicians, etc.,)
  • one way to do this is by training a simple RNN model over a large training "corpus" of statements / writings of a particular person
  • the text of this corpus is "one-hot-encoded" into standard basis vectors as so...

  • and then a model is trained on top, and learns to "mimic" their manner of writing / speaking
  • for example, an H.G. Wells bot would be trained on his written books

  • a small part of one-hot-encoded corpus would look like this

  • notice: the hidden state here summarizes a distribution of prior characters (since everything is one-hot-encoded), akin to Example 5
  • a exponential average moving forward through the document leaves distributional heat traces that look like this

  • and such a trained RNN model produces results like these...

4. Epilogue

  • a number of other topics to explore
    • Optimization: Swift to evaluate, slow to train, many tricks
    • Modeling: Very flexible modeling wise - can be used to model a wide array of weird things
    • Deeper networks: RNNs with many layers of hidden states (we discussed "shallow" RNNs today)